Airbnb Price Prediction



Introduction

Airbnb was started in 2008 by Brian Chesky and Joe Gebbia, and since then, it has gained popularity due to its low prices and direct interactions with the local community. Airbnb is an online home-sharing platform where people can list or rent properties for short-term use. Be it a spare bedroom, an apartment, a villa, a private island, or even a sofa, anyone looking to earn some profit can promote their space on Airbnb.

Airbnb provides its guests with many options and varieties such as the types of apartments, types of rentals, etc. but not a lot of functionality is provided to the hosts/homeowners to determine the optimum price for their listings. Airbnb listing price is a significant factor that hosts must get right. Especially in big cities like Amsterdam, where there is a lot of competition, even the slightest variations in the prices can get the listing priced out of the market. While listing out a property, Airbnb suggests a base price by considering property details, locations, and similar properties in the area. Some third-party services and websites provide general guidance, but none of them are free. Since the holiday letting market is constantly changing, hosts should be diligent in monitoring and adjusting their prices instead of only focusing on the base price initially set by Airbnb.

In this project, we think from the homeowner's perspective and try to determine the various factors that impact the listing prices most. By doing this, we help the hosts determine the optimum nightly price for their listings and maximize their earnings from the listings. We decided to work on the Airbnb data of one of the biggest cities, Amsterdam. Unlike other cities, Airbnb has not been in a favorable position in Amsterdam due to newly established renting regulations and close competition with apartments and other rental services. We will analyze multiple data sets collected from insideAirbnb, including parameters like availability, neighbors, room types, fees, and more. In addition, we will explore the guest reviews and impact of the reviews on Airbnb rentals to help hosts make better decisions.

Questions of Interest

How does the accessibility to various amenities affect the price of Airbnb listings?

Amenities offered by the host play a significant role in the occupancy rate of a listing. Guests nowadays prefer to have basic amenities like Wifi, toiletries, heating, air conditioners, etc., to be included in the listing. We will analyze how the acceptable pricing varies depending on the number of amenities offered so that the host can determine the type of features or amenities he can add to his listing that will allow him to charge more.

What are the various factors which affect the reviews? What insights can we gain from them?

Having positive reviews on a particular listing increases its probability of being accepted at a higher price by renters and also increases its occupancy rate. We will analyze which factors end up affecting reviews the most and how the host can effectively use them to improve the overall rating of their listing.

Which areas have the most Airbnb properties, and which are the most expensive?

Locating the most popular rental spots is undoubtedly one of the essential factors for attracting more guests. Usually, there are popular tourist spots or locations near the center of cities where people will pay a higher price to stay. Not only can the host increase profits by charging a higher fee at these locations, but he can also tune the pricing based on the competition in these areas and create a separate customer base of his own.

How can we help Airbnb hosts to determine the optimum nightly price for their listings?

Setting an optimum nightly rate according to market demand heavily influences the overall occupancy rate of a listing. The pricing strategy can vary depending on the type and characteristics of the listing. So we will use machine learning to help the host determine the most influential features of their listing and how they can optimize the pricing by leveraging this information.

What are the various factors which affect the Airbnb listing?

This project focuses on helping an Airbnb host determine all the ways in which they can make the most profit out of either their existing listing or the factors that a potential host must consider before they publish a listing on Airbnb. This information can be extremely valuable to a host as it would help them understand which factors they should be focusing on more for their existing or potential listings.

Data Acquisition

The datasets used for this project were collected from the website Inside Airbnb, which is an independent, non-commercial, open source data tool. This investigatory website scrapes and reports publicly available information about a city's airbnb listings. The dataset used in this project was scraped on September 7th, 2021 and it contains information on all Amsterdam listings which were live on the website on that day.

We have used the following datasets from the website:

Even though the compiled data provides useful basis for examining and monitoring Airbnb practices, it has some critical limitations. The major one is that it only includes the advertised price. The sticker price is the overall nightly price that is advertised to potential guests, rather than the actual average amount paid per night by previous guests. The advertised prices can be set to any arbitrary amount by the host, and hosts that are less experienced with Airbnb will often set these to very low or very high.



Data Processing

The data that was obtained from Inside Airbnb contained a lot of information and needed to be transformed in order for the data to be useful for data visualization and machine learning. We performed the following operations on the data in order to prepare it for the same.

  1. Data Collection
  1. Data Cleaning
  1. Data Wrangling/Manipulation
  1. Data Transformation

Data Collection - Inside AirBnB Data

We will import all the required libraries before starting the analysis. Then we will load the listings data to a dataframe.

The dataset has 16116 listings and 74 columns for each listing.

Data Cleaning

Free text columns and other columns which are not useful in predicting price are being dropped, and the remaining columns are stored in a new dataframe listings_clean_df, keeping the original data intact in listings_full_df. We will also load the reviews data to a dataframe reviews_full_df, and calendar data in calendar_full_df.

We will drop the comments column in reviews since we will not be using NLP.

We will check if any columns in the three dataframes has null values and if these values will impact our analysis. If the total number of null values in a column is significant, then we will drop the column.

Dropping neighbourhood_group_cleansed and bathrooms from listings since these columns have no data. We also have to drop the host_response_time and host_response_rate since they have almost 70% null values each, so it won't be useful to us in drawing any analysis.

We will drop the calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms columns since they are parts of the calculated_host_listings_count. Calculated_host_listings_count is the sum of all these values.

There are multiple columns with data on the minimum and maximum number of nights a guest is allowed to stay - minimum_nights, maximum_nights, minimum_minimum_nights, maximum_minimum_nights, minimum_maximum_nights, maximum_maximum_nights, minimum_nights_avg_ntm and maximum_nights_avg_ntm. Among them, we will only be retaining the minimum_nights and maximum_nights columns as this data is sufficient for our analysis.

Data Wrangling

Some columns have true or false values. We will convert this categorical data and assign numeric values to them. To understand the distribution of data, we are creating histograms. We will then drop the columns which contain only one value since these columns will not impact our analysis.

Data Transformation

Reassigning property types

In the listing data, there are multiple property types assigned to the listings. Some of the same property types are considered to be different because of the format of the data, and some entries also include the room types (ex. Private room in rental unit). We will remove the room types and assign new categories to the property types. Some of the entries do not have any specific property type, so they are changed to 'unknown'.

Formatting host history

All the data in the 'host_since' column should be dates. We will verify the pattern using regex.

Since there were no data in the wrong format, we can transfer the column into datetime format. In the following snippet, we have converted the column into a measure of the number of the days that the host has been on the platform. The data was scraped at 2021/09/07, so the active days of host will be calculated upto this date.

Formatting the bathroom_text column

We found that the data in the bathroom_text column is in different formats where the number of bathrooms and type are combined together. Some of the numbers are in numeric format whereas others are descriptive. We will replace the descriptive values and then split them into separate columns with numbers and types of bathrooms.

We will create a dictionary to format the data to "number shared/private bath/baths".

Splitting columns into bathrooms_number to store the number of the bathroom, and bathrooms_type to store the type of the bathroom, and replacing nan and no bathroom type data into unknown.

Process amenities available in listings

The following function returns a list of unique amenities in the series passed to it.

In addition to the differences in the importance of amenities, there are also great differences in the frequency of amenities. Some amenities are almost necessary (extremely frequent), while others are very rare.

In this project, amenities are classified and extracted, and only relatively important amenities/amenity groups are taken out. At this stage, the importance of amenities does not come from any mathematical calculation but is based on common sense and daily inference. The extracted amenities will be further studied and screened in the next stage. Many too few or too many facility groups will not be used as meaningful price influencing factors because they do not provide sufficient diversity.

It can be seen that more than 97% of properties are equipped with internet amenities, so the internet cannot provide effective difference information. Therefore, we do not take the internet column as a valid column and drop it.

We will create a new column called score, whose value represents the number of supporting facilities selected above. For example, if one's amenities only contains Elevator and Smart lock, its score will be 2.

Process review ratings

The listings without reviews will be replaced with 'no reviews'. The remaining ratings will be grouped into bins. To determine the useful bins we create histograms to display the distribution of ratings for all the review ratings columns.

From the above histograms we can see that most of the ratings are 4/5 or 5/5. Therefore, the ratings 4/5 and 5/5 will be separately grouped and all the remaining ratings will be grouped together into a single bin.

Data Dictionary

Below is the list of all columns and a short description of the data each column contains from the final processed dataframe.



Data Analysis and Visualization

Q1 How does the accessibility to various amenities affect the price of Airbnb listings?

We think that the number of amenities and the type of the amenity may be a factor that affect price. Perhaps a host can charge a higher price if he provided a gym. Or perhaps a host can charge higher if he provided lots of amenities.

Change the data type of price from string to float. First we need to remove the '$' Since some number contains ',', for example : 8,000, so we need to remove ','. Name the new column to 'price_format'.

To check the distribution of the price, we use box plot to plot the price_format. By running the comment below, we can find out that the overall price were around 0~1000, and there were a lot of outliers. So we use where function to filter the price. We consider that the price over 500 is high, so we change the price over 500 to 500 to generate more meaningful plots.

To check the performance, we plot the boxplot again.

Now we did the same thing on price for the df_amenities dataframe.

To see whether having the amenity affects price, we used a for loop to generate box plot for each amenity. We can see the result by running the comments below.

Inference

In the above plots, we can easily find out whether having the specific amenity result in higher prices by comparing the True and False boxes. The higher the box is, the higher the prices was.

We can conclude that: The amenities with higher prices: recreation, air condition, cook, work, gym, parking, childern, secure

The amenities that do not have higher prices: elevator, breakfast

So we can suggest that if the listing has recreation, air condition, cook, work, gym, parking, childer or secure, the host can place the price higher.

Also, amoung these amenities, having secure and children has a bigger difference with not having them. So if a host wants to open a new listing and he can provide these amenities, he can price it more higher.

On the other hand, we find out that the price between having breakfest and not having breakfest were similar. So we suggest the host to not provide breakfest because it is costly.

We think that having more amenities may a a higher price. To check this, we generated a box plot that shows the relationship between Number of Amenities and Price.

Inference

We found out that while the number of amenities gets higher, the box also gets higher. This means that a host can charge a higher price if they provided lots of amenities. But when we look at score 9, it has lower prices. So we need to take a look at the data.

Since there were only two data points that has nine amenities, we consider that maybe there were some other factors that were stronger than the number of ameties the affects the price.

Q2 What are the various factors which affect the reviews?

Most people look at the reviews of a listing before making the final decision. So we will try to get an overview of how the reviews are distributed, if there is a any noticeable pattern in how guests rate a listing and get insights from these to help the host.

If we look at the distribution of the feature 'review_scores_rating', which is the overall rating of a particular listing, we can see that most of the ratings lie in 5, which translates to having 80% or more rating. So people have good experiences with the airbnbs so far in Amsterdam.

We will look into the reviews based on individual criteria to see if any meaningful insight can be extracted from them. There are a total of seven review columns, based on the description accuracy, cleanliness, checkin, communication, location, value for money, and the total rating. After plotting the rating columns, we can see that users comparatively give less positive feedbacks on cleanliness, value and location.

Q3 Which areas have the most Airbnb properties, and which are the most expensive?

Use the price filter we created to plot a map. The darker the dot is, the higher the price is.

In the above plot, we found out that there are specific areas that has darker dots. Which means the area has higher price. The area that has higher price were: Amsterdam Marina Tolhuistuin Vondelpark Sportpark De Eendracht

Among these places, there is only a few listings in Sportpark De Edndracht. So we suggest the hosts to open up new listings in the area and place it in a higher price.

Q4 How can we help Airbnb hosts to determine the optimum nightly price for their listings?

Since the prediction target price is a continuous and specific value, we should use regression model other than classification model.

Data Pre-Processing for Machine Learning

Show data type attributes

Correlation Analysis and Redundancies

Create a heatmap for correlation matrix, which shows higher correlation with lighter.

Drop irrelevant features

Drop features with high correlation

There are three lighter cluster on the above picture, which means there are three clusters of features and each of them has high correlation.

For the upper left cluster('accommodates', 'beds', 'bedrooms'), we keep 'accomodates'.

For the middle cluster('availability_30','availability_60','availability_90','availability_365'), we keep 'availability_365'.

For the bottom cluster('review_scores_accuracy', 'review_scores_cleanliness','review_scores_checkin', 'review_scores_communication', 'review_scores_value', 'review_scores_location'), we keep 'review_scores_value' and 'review_scores_location'.

Drop nan value

Outliers and Skewness Detection

Filter the dataset with price because if the price is 0, it is obvious abnormal and unmeaningful.

Eliminate right skewness

Eliminate the right skewness with log transformation.

Encoding data with one-hot encoder

A one-hot vector is a 1 × N matrix used to distinguish each word in a vocabulary. The vector consists of 0s in all cells with the exception of a single 1 in a cell used uniquely to identify the word.

For those columns with category data type(str), we use onehot encoder to transform them into numerical values, so the algorithms can process them.

Modeling

Separating and Sampling

Linear Regression Model

Result (plot and statistics)

Relationship between predictions and actual values graphically with a scatter plot

Random Forest Regressor

Result (plot and statistics)

Relationship between predictions and actual values graphically with a scatter plot

Hyperparameter Tuning

Use RandomizedSearchCV to find optimized hyperparameter, including n_estimators,max_features,max_depth and min_samples_leaf.

Result (plot and statistics)

Show relationship between predictions and actual values graphically with a scatter plot

Neural Network

Result (plot and statistics)

Relationship between predictions and actual values graphically with a scatter plot

Loss changes

Plot the loss changes during training, to find whether the model converged.

Inference

We choose the Random Forest model as the final prediction model, because its RMSE is the smallest, which means that the prediction error is the smallest. For neural network models, more complex structures and parameters can be further adjusted in the future.

Q5 What are the various factors which affect the Airbnb listing?

Feature importance

One advantage of using decision tree methods is that they can automatically provide an estimate of feature importance from the trained prediction model.

The importance indicates the value of each feature when building an enhanced decision tree in the model. The more times the attribute makes key decisions in the decision tree, the higher its relative importance.

Inference



Conclusion

After analyzing the Airbnb data of Amsterdam, we can say that a higher number of amenities increases the probability of the listing being accepted at a higher price. Among them, we identified which amenities have the most influence on pricing. We did a customer review analysis to identify the factors (cleanliness, location, value for money) that the host can improve on to gain a competitive advantage, and also the zones they can target to maximize the profits from the listings along with increasing the overall occupancy rate. Finally, after considering all of these, we compared multiple price prediction models and selected the Random Forest model as the final prediction model, because its RMSE is the smallest, which means that the prediction error is the smallest. This prediction model can be vital for the hosts determine an optimal price for their listings.



References

  1. Airbnb Data Analysis — Toronto
  2. A Simple Approach to Data— Analysis of NYC Airbnb Listings
  3. How to Analyze Airbnb Performance Data in the Right Way
  4. 7 Steps to an unbeatable Airbnb Pricing Strategy